Interactive Maps and Geodemographic Clustering with K-Means classfication method.
Author
Jay T. Pentapati
Published
January 10, 2024
Note:
Please click ‘Go Back’ Button for Central Page
1 Installing Packages
Below code will check whether packages listed within variable (packages_required) are installed and not. If not installed, It will automatically installs and loaded all required packages for the Analysis.
Code for Installing all required R packages.
packages_required <-c("tidycensus", "tigris","tidyverse", "data.table","sf", "tmap", "tmaptools","readr", "geojsonsf", "osmdata","devtools", "RColorBrewer","classInt","R.utils","dplyr", "ggplot2", "viridis","raster","terra","exactextractr", "tidyterra", "tibble", "rosm", "spdep", "cluster", "GGally", "rgeoda", "tidyr", "gridExtra","patchwork") # Include any additional packages needed, in above variable .new_req_packages <- packages_required[!(packages_required %in%installed.packages()[,"Package"])] # checking which packages are already installed and filtering out installed packages, new variable is assigned for names of uninstalled packages.if(length(new_req_packages)) install.packages(new_req_packages) # packages(not installed yet) are being installed. for(i in1:length(packages_required)) {library(packages_required[i], character.only = T) # loading all packages named in packages_required}# Below message is differently coded output for checking installed packages, not the output of above line
[1] “All required packages for Assignment-2 are installed and loaded properly.”
2 Introduction
2.1 Rationale for CRS
According to USGS Website, all maps in USA uses some projection in the State Plane Coordinate System (United States Geological Survey, 2023). It is plane coordinate system in which each state is divided up to six zones, depending on geometry of the state (United States Geological Survey, 2023). These zones use a Mercator or Lambert conic projection which preserves the shape and scale over smaller areas like zcta’s or tracts.
By looking into Map of USA State Plane Zones NAD83 in ArcGIS website (ArcGIS Hub, 2020), we can identify that Bay Area is in California Zone 3. Therefore, we are going to use EPSG: 2227 – NAD 83 California zone 3 (ftUS).
2.2 Description of Data
Airbnb Listings Data of San Francisco, Oakland and San Mateo are extracted from Inside Airbnb Website. Chosen Variables from the list of socio-economic metrics of the ACS are Median Household Income (B19013_001E) and No. Of People aged 18-64 who has computer and Broadband Internet Subscription (B28005_011E). We also use Bay Area Zip Code Areas.
For additional Datasets, we are going to use No. of Vehicles (light-duty) 2020 per zip code and Average Electricity Utility Rate 2020 (Commerical Rate) per zip code, both investors owned and non-investor owned utilities.
Note: API Key can be obtained by signing up at http://api.census.gov/data/key_signup.html for Census Data. We, then use tidycensus package to get ACS data from US Census Bureau.
Code for setting up API Key
census_api_key(J_API_KEY , install =TRUE, overwrite =TRUE) # Please replace 'J_API_KEY' with API Key obtained from census bureau# Below message is differently coded output for checking system environment variable(CENSUS_API_KEY), not the output of above line
[1] “API key to get Census data is successfully set up.”
2.3 Data Extraction and Pre-Processing
We are making two API calls to retrieve Median Household Income (B19013_001E) and the No. of people aged 18-64 who have a computer and Broadband Internet Subscription (B28005_011E) from ACS 2016-2020 separately. This approach is more efficient and straight-forward than making a single API call with two variables. The latter can results in a spatial or non-spatial file with a long-format attribute table, which, in turn, needs to be converted into a wide-format table for analysis.
We are extracting values of two variables for entire US. Then Later, we filter out zip codes specific to Bay Area. This is because, some califorina zcta’s extends into other states, Which results in error while extracting zcta’s related data for specific states (‘zip-Determining which US zipcodes map to more than one state or more than one city?-Geographic Information Systems’, 2021). According to US Census Bureau, All ZCTA’s codes are valid US Zip Codes(United States Census Bureau, 2023).
Spatial Resolution: zcta’s level
Time: 2020
Cleaning and pre-processing Data
median_hh_income <-get_acs( # get_acs is a function, we are using to get ACS Suvrey data from US Census Bureaugeography ="zcta", # we are specifying geography unit for variable we getting, like, ZCTA, Tract, County, State etc.variables ="B19013_001E", # we are geeting MEDIAN HOUSEHOLD INCOME, "B19013_001E" is variable name in ACSyear =2020, # we are getting census data collected from 2016-2020geometry =FALSE# we use geometry of bay area zipcode shapefile from berkeley library, not ACS Data) %>%rename(estimate_MedianHhI = estimate) %>%# we are renaming retrieved variable(estimate to appropriate name ( estimate_Median_HhI)select(GEOID, estimate_MedianHhI) # we are selecting only GEOID(ZCTA's CODE, also ZIP CODE), estimate Value (ignoring MOE)# Retrieving Internet Use Variableinternet_use <-get_acs( # get_acs is a function, we are using to get ACS Suvrey data from US Census Bureau geography ="zcta", # we are specifying geography unit for variable we getting, like, ZCTA, Tract, County, State etc.variables ="B28005_011E", # we are geeting No.of People aged 18-64, who has computer and Broadband Internet Subscription, "B28005_011E" is variable name in ACSyear =2020, # we are getting census data collected from 2016-2020geometry =FALSE# we use geometry of bay area zipcode shapefile from berkeley library, not ACS Data) %>%rename(estimate_InternetUse = estimate) %>%# we are renaming retrieved variable(estimate to appropriate name ( estimate_Internet_Use)select(GEOID, estimate_InternetUse) # we are selecting only GEOID(ZCTA's CODE, also ZIP CODE), estimate Value (ignoring MOE)us_zip <-left_join(median_hh_income, internet_use, by ="GEOID") # simply join two datasets by GEOID to get Dataset in req format
Code for Showing first six rows of Data being used
We downloaded the Bay Area listings data (San Francisco, Oakland and San Mateo) from Airbnb Website and Bay Area Zip-Code Areas from Berkeley Library. Loading all .csv files by using read.csv function from utils package (Loading process can be found in source code of this file).
Two Additional Datasets:
We will be using Commerical Electricity Rates Dataset for Both investor-owned and non-investor owned utilities per ZIP Code.
Code for Data cleaning and Pre-processing of Electricity rates data
non_iou_data_c <- non_iou_data %>%# Data cleaning process for additional datasets, select(zip, comm_rate) %>%# we select only zip, commerical rate from non-investor based utilitiesgroup_by(zip) %>%# we group them by ZIPsummarise(Avg_Non_IOU_Comm_Rate =mean(comm_rate, na.rm =TRUE) # Mean of different Non-Investor Based utilities per ZIP )iou_data_c <- iou_data %>%select(zip, comm_rate) %>%# we select only zip, commerical rate from Investor based utilitiesgroup_by(zip) %>%# we group them by ZIPsummarise(Avg_IOU_Comm_Rate =mean(comm_rate, na.rm =TRUE) # Mean of different Investor Based utilities per ZIP )electric_u <- non_iou_data_c %>%full_join(iou_data_c, by ="zip")%>%# Join both Investor-owned and Non-Investor utility datasets by zipcodemutate(Avg_IOU_Comm_Rate =replace_na(Avg_IOU_Comm_Rate, 0), # replacing any NA values with 0Avg_Non_IOU_Comm_Rate =replace_na(Avg_Non_IOU_Comm_Rate, 0 ), # replacing any NA values with 0Combined_Avg_Comm_Rate = (Avg_Non_IOU_Comm_Rate + Avg_IOU_Comm_Rate ) /2# using average as statistic for combined Commerical electricity Rate )%>%select(zip, Combined_Avg_Comm_Rate) %>%rename(ZIP = zip) %>%mutate(ZIP =as.character(ZIP)) # selecting only ZIP and New Statistic, ignoring restvc_califorina <- vehicle_count_2020 %>%select(-Date, -Model.Year, -Fuel, -Make) %>%# we are only selecting Zip.code, Vehicles and Duty cols, and ignoring restfilter(Duty =="Light"& Zip.Code !="OOS") %>%# we are filter only Light-Duty vehicles and removing Out of Service(OOS) locations from datagroup_by(Zip.Code) %>%# group by ZIP Codesummarise(No_of_Light_Duty_Vehicles =sum(Vehicles)) %>%rename( ZIP = Zip.Code) # summing up Light-Duty Vehicles per ZIP and then rename it
3 Methodology
3.1 No. of AIRBNB Listings with Mean Price per ZIP Code Areas
Bay Area Airbnb’s Listings, We downloaded have Longitude and Longitude coordinates, So initially, We set CRS of Listings to EPSG: 4326 - WGS84. Then, We perform CRS re-projection into EPSG: 2227 – NAD 83 California zone 3 (ft-US).
We then, perform spatial join for Bay Area ZIP Code Areas and AIRBNB Listings, which results in Overlay Shape-file. This will enable us to understand geographical distribution of AIRBNB Listings and their charactertics over Bay Area ZIP Code Areas.
Code for pre-processing of Shapefiles
listings_bayarea_sf <- listings_bayarea %>%st_as_sf(coords =c("longitude", "latitude")) %>%# creating point shapefile from coordinatesst_set_crs(4326) # setting CRS to geodetic system of earth (4326, WGS84) as they are long, lat coordslistings_bayarea_sf_transform <-st_transform(listings_bayarea_sf, crs =2227) # Now, Changing CRS to mentioned Project CRSoverlay_listings <-st_join(bayarea_zipcodes, listings_bayarea_sf_transform) # Performing Spatial Join to create overlay Shapefile of Airbnb Listings and Bay area ZIP shapefilest_crs(overlay_listings) ==st_crs(listings_bayarea_sf_transform) # Verifying CRS of New shapefile
[1] TRUE
We need to plot No. of Airbnb’s and their mean price per zipcode in Bay Area. To find out necessary variables, we group by Zip Code and count no. of non-NA price rows of listings in each zip code and use function(mean) to find mean price of those listings.
listings_bayarea_zip <- overlay_listings %>%# creating new dataframe with no.of Airbnb's per ZIP and Mean Price per ZIPgroup_by(ZIP) %>%# group by ZIPsummarise(count_airbnb =sum(!is.na(price)), # counting non-NA price rows for listingsmean_price =mean(price, na.rm =TRUE) # calculate mean, ignoring NAs )%>%filter(count_airbnb !=0)
listings_bayarea_zip1 <- listings_bayarea_zip %>%select(count_airbnb, everything()) # Changing the order of attributes for interactive maplistings_bayarea_zip2 <- listings_bayarea_zip %>%select(mean_price, everything())%>%# Changing the order of attributes for interactive mapmutate(mean_price =round(mean_price, 1)) # rounding mean price to one decimal place.
Based on standard statistics for both Count and Mean Price variables of AIRBNB’s per zip code, Minimum value of count variable is 1 and first quartile is 71, so first class is divided accordingly. Median of count variable is 172 and standard deviation is 162.5, so, second and third class is divided as defined. Finally to identify outliers, As Third quartile is 257, standard deviation is 162.5 and Maximum Value is 794, last class has 500 as its lower boundary.
Minimum value of Mean Price variable is 71 and first quartile is 154, so first class is divided accordingly. Median of Mean Price is 204 and standard deviation is 297.9, so, second and third class is divided as defined. Finally to identify outliers, As Third quartile is 282.6, standard deviation is 297.9 and Maximum Value is 1818, last two classes have been divided 1000 and 1800 as their lower boundaries respectively.
New forms of spatial data like Airbnb data can offer latest snaphots of how people are are using spaces and leads to real-time insights. There will generally have high Heteroskedasticity and also provide detailed data at granular level such as individual Airbnbs. On other hand, New form of spatial data like Airbnb data never represent all segments of markets. Collection of Data can introduce measurement errors due to less standard methodologies.
Airbnb data is dynamic and reflects real-time changes in market, like social media or GPS data. Like social Media or GPS data, Airbnb data is user-generated. This is quite different from census-type of data.
Code for Maps of Count and Price of AIRBNBs per ZIP
tmap_mode("view") # Interactive modemap1.11<-tm_shape(listings_bayarea_zip1) +# listings_bayarea_zip1 is just re-arrangement of listings_bayareaa_zip with first column as count_airbnbtm_fill( "count_airbnb",style="fixed", title ="No. of Airbnb's per ZIP", alpha =1, breaks=c( 1, 72, 172, 359, 500, Inf), palette ="-viridis")+tm_borders()+tm_text("ZIP", size =0.3)+tm_view(set.view =c(center_lon, center_lat, zoom_level))map1.12<-tm_shape(listings_bayarea_zip2) +# listings_bayarea_zip2 is just re-arrangement of listings_bayareaa_zip with first column as Mean Pricetm_fill( "mean_price",style="fixed", title ="Mean Price of Airbnb per ZIP", alpha =1, breaks=c( 74, 155, 287, 501,1000, 1800, Inf), palette ="-viridis")+tm_borders()+tm_text("ZIP", size =0.3)+tm_view(set.view =c(center_lon, center_lat, zoom_level))tmap_arrange(map1.11, map1.12)
3.2 Median Household Income and No.of People aged 18-64 who has Computer and Internet Subscription per ZIP Code Areas
We filter Median Household Income and Internet Use Variable for Bay Area Zip Codes. Then we join ACS Dataset to Bay Area ZIP Code Areas shapefile. As an additional step, we remove all ZIP codes where Internet Use variable is missing.
Code for Cleaning Attribute information in Shapefile
bay_area_zipcodes <- bayarea_zipcodes$ZIP # Creating new var with Bayarea ZIP codesbayarea_zip_vars_data <- us_zip %>%# Filtering ACS Data with Two variables, by Bay Area ZIP codesfilter(GEOID %in% bay_area_zipcodes)%>%# Using Bay Area ZIP codes variable, created aboverename(ZIP = GEOID) # renaming GEOID with ZIPbayarea_zip_vars <- bayarea_zipcodes %>%left_join(bayarea_zip_vars_data, by ="ZIP") %>%filter(!(is.na(estimate_InternetUse))) # Joining two vars from ACS with bayarea_zipcodes shapefile which is downloaded from berkeley library and also removing NA values of Internet Use variable.
head(bayarea_zip_vars)
Simple feature collection with 6 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 5936840 ymin: 2232344 xmax: 6251037 ymax: 2506800
Projected CRS: NAD83 / California zone 3 (ftUS)
# A tibble: 6 × 8
ZIP PO_NAME STATE Area__ Length__ geometry
<chr> <chr> <chr> <dbl> <dbl> <MULTIPOLYGON [US_survey_foot]>
1 94558 NAPA CA 12313263537 995176. (((6102909 2377436, 6102846 23769…
2 95620 DIXON CA 7236949521. 441860. (((6230747 2302747, 6219259 23030…
3 95476 SONOMA CA 3001414165. 311319. (((6013411 2248863, 6013202 22488…
4 94559 NAPA CA 1194301745. 359105. (((6045939 2248058, 6044555 22480…
5 94533 FAIRFIELD CA 991786103. 200773. (((6146297 2299595, 6146371 22987…
6 94954 PETALUMA CA 2006544443. 267474. (((5998504 2235043, 5991182 22339…
# ℹ 2 more variables: estimate_MedianHhI <dbl>, estimate_InternetUse <dbl>
We remove all ZIP codes where Median Household Income variable is missing.
Summary of Median Household Income per Zipcode (estimate_MedianHhI):
Min. 1st Qu. Median Mean 3rd Qu. Max.
40813 89583 111037 122296 153256 250001
Summary of No.of People aged 18-64 who has Computer and Internet Subscription
per Zipcode (estimate_InternetUse):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 8110 16235 17830 24914 59638
Standard Deviation of Variable(estimate_MedianHhI): 45357
Standard Deviation of Variable(estimate_InternetUse): 12765.53
Based on standard statistics for both variables of ACS per zip code, Range and Standard Deviation for both variables are large, We use Fisher-Jenks Data Classification method, because of highly-skewed data.
Bay Area ZIP Code Areas is Mixed-Use Type of Neighborhood. Most of neighborhoods have medium Household Incomes and High Internet Subscriptions. Socio-Economic Indicators like Median Household Income and Internet Use helps in delineating this typology of Neighborhood. San Francisco and Oakland have more-affulent neighborhoods due to High Internet Usage and medium Household Income, whereas San Mateo have High Household Income Neighborhoods and Medium Internet Usage.
Based on the given Classification, Reasonable Hypothesis will be that AIRBNB would cluster in San Francisco and Oakland, due to High Internet Usage and Medium Household Income (Middle to High-Class Residential Areas). AIRBNB listings would also cluster in Areas with low commercial electricity rate and High No.of Vehicles.
Code for Maps of Median Household Income and Internet Use per ZIP
tmap_mode("plot")map2.1<-tm_basemap() +tm_shape(bayarea_zip_vars1) +# shapefile for map 2tm_fill(col ="estimate_MedianHhI", palette ="YlGn", n =5, style ="fisher", title ="Median Household Income (in $ USD)") +# estimate_MedianHhItm_compass(position =c(0.85, 0.85)) +# Adding Compasstm_scale_bar(position =c(0.625, 0.02)) +# Adding Scalebartm_borders(lwd =0.3)+# Borders for zcta's polygonstm_layout(legend.title.size =0.715,legend.text.size =0.6, inner.margins =c(0.22, 0.13, 0.02, 0.08), legend.position =c(0.05, 0.02), legend.width =0.5, bg.color ="#eaf5fa",main.title ="Median Household Income per ZIP Code", # Add the map title within tm_layoutmain.title.size =0.7,main.title.position ="center" )map2.2<-tm_basemap() +tm_shape(bayarea_zip_vars) +tm_fill(col ="estimate_InternetUse", palette ="YlGn", n =5, style ="fisher", title ="Broadband Internet Connection (People)") +tm_compass(position =c(0.85, 0.85)) +tm_scale_bar(position =c(0.625, 0.02)) +tm_borders(lwd =0.3)+tm_layout(legend.title.size =0.715,legend.text.size =0.6, inner.margins =c(0.185, 0.13, 0.02, 0.08), legend.position =c(0.05, 0.02), legend.width =0.5, bg.color ="#eaf5fa",main.title ="No. of People aged (18-64) using Broadband Internet per ZIP Code", # Add the map title within tm_layoutmain.title.size =0.7,main.title.position ="center" )
tmap_arrange(map2.1, map2.2)
3.3 MAP 3: AIRBNB Listings and Internet Use Variable (Combining Datasets)
We change the order of attributes in Bay Area ZIP Codes as following:
We calculate Natural Logarithm of Price of Airbnb Listings in Overlay Shapefile and Checking CRS of both shapefiles:
Code for Calculating Log prices of AIRBNBs
listings_bayarea_sf_transform_copy <- listings_bayarea_sf_transformlistings_bayarea_sf_transform_copy$nl_price <-log(listings_bayarea_sf_transform_copy$price) # calculating natural logarithm of price ( log to base e)st_crs(bayarea_zip_vars_map3) ==st_crs(listings_bayarea_sf_transform_copy) # checking CRS of both shapefiles
[1] TRUE
We are changing the order of attribute in Overlay Listings Shapefile as following:
Based on standard statistics for Internet Usage(People) per zip code and Natural Logarithm of AIRBNB Price, Range and Standard Deviation are large, We use Fisher-Jenks Data Classification method, because of highly-skewed data.
AIRBNB Listings are majorly clustered in San Francisco, Daly City, South San Francisco, Burlingame, Foster City, Redwood City and Oakland areas(geographical observations from below map). High Natural Logarithm of Prices are found in San Francisco, West Coast Seaside and Redwood City(San Carlos).
Apparently in San Francisco, ZIP Code areas with High Internet Usage have High AIRBNB Listings. ZIP Code Areas with 5,500 people using internet, have high price for AIRBNB Listings.
Code for Maps of Internet Use per ZIP in San Francisco
tmap_mode("view") # interactive mode activatingmap3 <-tm_shape(bayarea_zip_vars_map3) +tm_fill(col ="estimate_InternetUse", alpha =1, palette ="YlGn", n =5, style ="fisher",title ="Broadband Internet Connection(People)") +tm_borders() +tm_text("ZIP", size =0.3)+tm_shape(listings_bayarea_sf_transform_copy) +tm_dots(col ="nl_price", palette ="Reds", n =3, style ="fisher", title ="Natural Logarithm of Airbnb's Price") +tm_view(set.view =c(center_lon, center_lat, zoom_level))# Set legend position for interactive viewmap3
3.4 MAP 4: Spatial Auto-Correlation
Calculating list of neighbors(Neighborhoods) and Queen-based Spatial Weights Matrix for Bay Area ZIP Code areas which have Median Household Income:
Code for calculating spatial weights martrix for ZIP codes which have Median Household Income
nb_q1 <-poly2nb(bayarea_zip_vars1, queen =TRUE) # shapefile used, have no NA values of Median Household Income variablew_queen1 <-nb2listw(nb_q1, style ="B", zero.policy=TRUE) # queen-contiguity based Weights for list of neighbors for all observations isolates1 <-which(w_queen1$neighbours =="0")bayarea_zip_vars1_NoZero <- bayarea_zip_vars1[-c(isolates1),] # shapefile where there are no observation with zero neighbours based on queen-contiguity
Calculating list of neighbors(Neighborhoods) and Queen-based Spatial Weights Matrix for Bay Area ZIP Code areas which have Internet Usage:
Code for calculating spatial weights martrix for ZIP codes which have Internet Use
nb_q2 <-poly2nb(bayarea_zip_vars, queen =TRUE) # shapefile used, have no NA values of Broadband Internet Usew_queen2 <-nb2listw(nb_q2, style ="B", zero.policy=TRUE)isolates2 <-which(w_queen2$neighbours =="0")bayarea_zip_vars_NoZero <- bayarea_zip_vars[-c(isolates2),] # shapefile where there are no observation with zero neighbours based on queen-contiguity
3.4.1 Global Spatial Auto-Correlation:
Calculating list of neighbors(Neighborhoods) and Queen-based Standardized Spatial Weights Matrix for Bay Area ZIP Code areas which have Median Household Income or/and Internet Usage(People) :
Code for calculating spatial weights martrix for ZIP codes which have Median Household Income or Internet Use
nb_q1 <-poly2nb(bayarea_zip_vars1_NoZero, queen =TRUE) # Constructing neighbours list from filtering observations which have no zero neighbours and no NA values of Median Household Incomew_queen_std1 <-nb2listw(nb_q1, style ="W") # creating spatial weights matrix using queen contiguity and row-standardardised weightsnb_q2 <-poly2nb(bayarea_zip_vars_NoZero, queen =TRUE) # Constructing neighbours list from filtering out observations which have zero neighbours and no NA values of Broadband Internet Usew_queen_std2 <-nb2listw(nb_q2, style ="W") # creating spatial weights matrix using queen contiguity and row-standardardised weights
Calculating Spatial Lag for Median Household Income and Internet Usage(People):
Code for calculating spatial lag for ZIP codes
bayarea_zip_vars1_NoZero$sl_MedianHhI <-lag.listw(w_queen_std1, bayarea_zip_vars1_NoZero$estimate_MedianHhI) # calculating spatial lag for Medain Household Income Variable in Bay Area bayarea_zip_vars_NoZero$sl_InternetUse <-lag.listw(w_queen_std2, bayarea_zip_vars_NoZero$estimate_InternetUse) # calculating spatial lag for Internet Use Variable in Bay Area
Moran I’s for Median Household Income in Bay Area ZIP Code Areas:
Code for calculating Moran’s I statitstic for Median Household Income
moran.mc(bayarea_zip_vars1_NoZero$estimate_MedianHhI, w_queen_std1, nsim=1000, alternative="greater") # Moran's I statistic for Median Household Income Variable in Bay Area
Monte-Carlo simulation of Moran I
data: bayarea_zip_vars1_NoZero$estimate_MedianHhI
weights: w_queen_std1
number of simulations + 1: 1001
statistic = 0.39557, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater
Moran I’s for Median Household Income in Bay Area is 0.396 and p-value is 0.000999. So, Observed Spatial Pattern is statistically Significant. P-value means chances of obtaining Moran I’s value greater than original, given random spatial distribution. If it is less than 0.05, We reject Null Hypothesis i.e, observed spatial pattern is due to random chance. Moran I’s is analogous to Chi-Square Statistic. Spatial distribution of Median Household Income is spatially concentrated than random spatial distribution.
Moran I’s for Internet Usage(People) in Bay Area ZIP Code Areas:
Code for calculating Moran’s I statitstic for Internet Use
moran.mc(bayarea_zip_vars_NoZero$estimate_InternetUse, w_queen_std2, nsim=1000, alternative="greater") # Moran's I statistic for Interent Use Variable in Bay Area
Monte-Carlo simulation of Moran I
data: bayarea_zip_vars_NoZero$estimate_InternetUse
weights: w_queen_std2
number of simulations + 1: 1001
statistic = 0.23313, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater
Moran I’s for Internet Usage (People) in Bay Area is 0.23313 and p-value is 0.000999. So, Observed Spatial Pattern is statistically Significant. Spatial distribution of Internet Usage (People) is mild spatially concentrated than random spatial distribution.
3.4.2 Local Spatial Auto-Correlation:
Calculating LISAs for Bay Area ZIP COde Areas for both Median Household Income and Internet Usage(People):
Code for calculating LISA statistics
lisa_perm1 <-localmoran_perm(bayarea_zip_vars1_NoZero$estimate_MedianHhI, w_queen_std1, nsim=1000, alternative="two.sided") # Lisa Clustering for Median Household Income in Bay Arealisa_perm2 <-localmoran_perm(bayarea_zip_vars_NoZero$estimate_InternetUse, w_queen_std2, nsim=1000, alternative="two.sided") # Lisa Clustering for Internet Use Variable in Bay Area
Calculating Quadrants where local moran I’s for observation, with significance cutoff at 0.2. As there are no quadrants at significance level 0.1, it is set by seeing to have least no.of levels for at least one variable:
Code for calculating LISA quadrants
quadrants1 <-hotspot(lisa_perm1, Prname="Pr(z != E(Ii)) Sim", cutoff=0.2) # Creating quadrants from significance values of local Moran's I statistic calculated from Median Household Incomequadrants2 <-hotspot(lisa_perm2, Prname="Pr(z != E(Ii)) Sim", cutoff=0.2) # Creating quadrants from significance values of local Moran's I statistic calculated from Internet Use Variable
Levels of Quadrant 1 :
Low-Low, High-Low, Low-High, High-High
Levels of Quadrant 2 :
Low-Low, High-Low, Low-High, High-High
Adding Quadrants back to Bay Area ZIP Code Areas Shapefile:
Code for adding new categorical variable back to shapefile
bayarea_zip_vars1_NoZero$quadrant1 <-as.character(quadrants1) %>%replace_na("Not significant") # Adding created quadrants to shapefile(no NA values for Median Household Income) where no observation with zero neighbours based on queen-contiguitybayarea_zip_vars_NoZero$quadrant2 <-as.character(quadrants2) %>%replace_na("Not significant") # Adding created quadrants to shapefile(no NA values for Internet Use Variable) where no observation with zero neighbours based on queen-contiguity
To view complete chunks, see Source code.
As There were no quadrants when cutoff is at 0.1, We set cutoff at 0.2 to have minimum no. of Quadrants for atleast one variable. These LISA clusters are likely to arranged by random chance. It means Observed Clustering at whole Bay Area Level has Spatial Structure, whereas LISA Clusters observed are, may be due to random chance. Observed Spatial Pattern in LISA clusters is not statistically significant.
But, also, It is very highly unlikely to say that observed spatial pattern in LISA clusters is just by random chance. This requires further and more advanced analysis.
We create overlay shapefile by spatial join of Bay Area ZIP Code Areas and AIRBNB Listings. Then, we group by to calculate all variables. Also, Join Average Commercial Electricity rate and No.of Light-Duty Vehicles per Zip Code.
Code for calculating dataframe required for Geodemographic Classification
overlay_listings <-st_join(bayarea_zip_vars, listings_bayarea_sf_transform, join = st_intersects) # spatial join to create Overlay Shapefilesoverlay_listings_zip <- overlay_listings %>%group_by(ZIP) %>%summarise(count_airbnb =sum(!is.na(price)), # count non-NA price rows for countmean_price =mean(price, na.rm =TRUE), # Mean price mean_review =mean(number_of_reviews, na.rm =TRUE), # No. of Reviewsmean_min_nights =mean(minimum_nights, na.rm =TRUE), # No. of Minimum Nights stayMedian_HhI =mean(estimate_MedianHhI, na.rm =TRUE), # Median Household incomeInternet_Use =mean(estimate_InternetUse, na.rm =TRUE) # Internet Usage )%>%filter(count_airbnb !=0) overlay_listings_zip <- overlay_listings_zip %>%left_join(vc_califorina, by ="ZIP") # Join overlay listings with Light-Duty Vehicles countoverlay_listings_zip <- overlay_listings_zip %>%left_join(electric_u, by ="ZIP") #Joining Overlay Listings with Commerical Electricity Rate.overlay_listings_zip <-na.omit(overlay_listings_zip) # removing NA values
We create 8-tuple Vector for multi-variate classification:
Subsetting Overlay Listings Shapefile without geometry attribute:
Code for sub-setting attribute information from shapefile
overlay_listings_zip_NoGeo <-st_drop_geometry(overlay_listings_zip[, airbnb_ra, drop=FALSE]) # we are dropping geometry col in shpoverlay_listings_zip_NoGeo <- overlay_listings_zip_NoGeo[complete.cases(overlay_listings_zip_NoGeo), ] # removing NA Values
K-Means Clustering:
set.seed(12345) # setting seedk6cls <-kmeans(overlay_listings_zip_NoGeo, centers=6, iter.max =1000) # KNN clusteing - 6 clusters (Each Observation will be 8-tuple )overlay_listings_zip$k6cls <- k6cls$cluster # Type of cluster assigned bacnk to overlay listings
k6cls$centers # centres of 6 clusters, six 8-tuple vectors defines centre of each cluster
By observing centers of different clusters, Main types of Neighborhoods can be identified as Tourist-focused areas, Residential Areas, Economically-diverse areas and Vehicle-dependent areas etc. Tourist-Focused Areas can be identified by High Count, High Internet Usage, Low Median Household Income. Residential Areas can be identified by High Median Household Income, High Internet Usage, High No.of Light-duty Vehicles etc. Economically Diverse ares can be identified by mediummedian household income (Median_HhI) and internet usage (Internet_Use). Vehicle-dependent areas can be identified by High Vehile count.
Characteristics helping to delineate typology are AIRBNB Metrics(‘count_airbnb’, ‘mean_price’, ‘mean_review’, ‘mean_min_nights’), Median Household Income, Internet Usage(People), No. of Light-Duty vehicles, Average Electricity Commercial Rate.
Using this Classification, We can target areas most in need, by, like supporting Economically weak Neighborhoods which can be identified by Low Median Household Income, Internet Usage(People) for Development Programs and Infrastructure Improvements. For Managing Tourist Activity, Neighborhoods with high Airbnb count, urban planning committees can focus on minimizing impact of Tourism on local communities, like affordable housing. For Transportation and urban planning, Neighborhoods with High Vehicle count might benefit from improved Public Transportation and Development of Local Amenities to reduce travelling. For Community Services like Digital Hubs, Which might improve overall livelihood in Neighborhoods with Low Internet Usage and Varied Median Hosehold Income.
Code for creating Geo-demographic clustering map
tmap_mode("plot")map_cluster1 <-tm_basemap() +tm_shape(overlay_listings_zip) +tm_fill(col ="k6cls", palette =viridis(256), style ="cont", title =" Type of Cluster") +tm_compass(position =c(0.85, 0.88)) +tm_scale_bar(position =c(0.58, 0.02)) +tm_borders(col ="white", lwd =0.2)+tm_layout(legend.title.size =0.7,legend.text.size =0.6, inner.margins =c(0.2, 0.13, 0.02, 0.08), legend.position =c(0.05, 0.02), legend.width =0.5, bg.color ="#eaf5fa",main.title ="Type of Clusters in Bay Area", # Add the map title within tm_layoutmain.title.size =0.7,main.title.position ="center" )map_veh <-tm_basemap() +tm_shape(overlay_listings_zip) +tm_fill(col ="No_of_Light_Duty_Vehicles", palette ="YlGn", n =5, style ="fisher", title ="Vehicles per ZIP") +tm_compass(position =c(0.85, 0.88)) +tm_scale_bar(position =c(0.58, 0.02)) +tm_borders(col ="black", lwd =0.2)+tm_layout(legend.title.size =0.7,legend.text.size =0.6, inner.margins =c(0.185, 0.13, 0.02, 0.08), legend.position =c(0.05, 0.02), legend.width =0.5, bg.color ="#eaf5fa",main.title ="No. of Vehicles(Light-Duty) in Bay Area", # Add the map title within tm_layoutmain.title.size =0.7,main.title.position ="center" )map_electric <-tm_basemap() +tm_shape(overlay_listings_zip) +tm_fill(col ="Combined_Avg_Comm_Rate", palette ="YlGn", n =5, style ="fisher", title ="Commerical Electrical Rate") +tm_compass(position =c(0.85, 0.88)) +tm_scale_bar(position =c(0.58, 0.02)) +tm_borders(col ="black", lwd =0.2)+tm_layout(legend.title.size =0.7,legend.text.size =0.6, inner.margins =c(0.185, 0.13, 0.02, 0.08), legend.position =c(0.05, 0.02), legend.width =0.5, bg.color ="#eaf5fa",main.title ="Commercial Electrical Rate in Bay Area", # Add the map title within tm_layoutmain.title.size =0.7,main.title.position ="center" )
Geo-Demographic Classification allows for making data-informed decisions for various policy making and Societal Improvements. In Reality, It is much more nuanced, we need to have understanding of different neighborhood dynamics, real-estate markets and other Socio-Economic Factors. But, With target areas based on neighborhood charactertistics, Policymakers can very effectively address local issues and enhance overall community well-being.
Though Multi-Variate Clustering is performed, we neglected timeline factor, which introduces temporal inconsistencies. There can be biases introduced because of New forms of spatial data like AIRBNB Data. However, Apart from above insights in geodemographic classifcation, To make well-informed decisions for policymakers, We need more sophisticated approaches and Real-Time Verification procedures.
Note:
I apologize for any references I may have inadvertently forgotten to mention in this case study. All my work is built upon the invaluable contributions of the giants in the data science field, on whose shoulders I stand. I am deeply grateful for their insights and advancements, which have made this study possible.
---title: "Exploring Various Datasets of Bay Area and Finding Useful Insights for Policymakers"author: "Jay T. Pentapati"date: "10 Jan, 2024"image: san.jpgdescription: "Interactive Maps and Geodemographic Clustering with K-Means classfication method."categories: - R - Spatial Analysis - AIRBNB - Data Cleaning - RESTFul APItoc: trueformat: html: theme: light: flatly dark: darkly code-tools: true html-math-method: katex self-contained: true embed-resources: true number-sections: true number-depth: 6editor: visualexecute: warning: false---```{=html}<button style="position: fixed; top: 120px; left: 150px; " onclick="goBack()">Go Back</button><script>function goBack() { window.location.href = 'CaseStudies.html';}</script>```::: {style="margin-bottom: 40px;"}:::::: callout-tip## Note:Please click 'Go Back' Button for Central Page:::## Installing Packages {style="color: black"}Below code will check whether packages listed within variable (packages_required) are installed and not. If not installed, It will automatically installs and loaded all required packages for the Analysis.```{r packages_installation, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for Installing all required R packages."packages_required <- c("tidycensus", "tigris","tidyverse", "data.table","sf", "tmap", "tmaptools","readr", "geojsonsf", "osmdata","devtools", "RColorBrewer","classInt","R.utils","dplyr", "ggplot2", "viridis","raster","terra","exactextractr", "tidyterra", "tibble", "rosm", "spdep", "cluster", "GGally", "rgeoda", "tidyr", "gridExtra","patchwork") # Include any additional packages needed, in above variable .new_req_packages <- packages_required[!(packages_required %in% installed.packages()[,"Package"])] # checking which packages are already installed and filtering out installed packages, new variable is assigned for names of uninstalled packages.if(length(new_req_packages)) install.packages(new_req_packages) # packages(not installed yet) are being installed. for(i in 1:length(packages_required)) { library(packages_required[i], character.only = T) # loading all packages named in packages_required}# Below message is differently coded output for checking installed packages, not the output of above line``````{r message_1, echo=FALSE, results='hide'}# Checking if all required packages are installed and loadedif(all(packages_required %in% installed.packages()[,"Package"])) { message_1 <- "All required packages for Assignment-2 are installed and loaded properly."} else { message_1 <- "Some packages could not be installed or loaded."}```::: {style="margin-bottom: 20px;"}```{r ou, echo=FALSE, results='asis'}print(message_1)```:::## Introduction### Rationale for CRSAccording to USGS Website, all maps in USA uses some projection in the State Plane Coordinate System (United States Geological Survey, 2023). It is plane coordinate system in which each state is divided up to six zones, depending on geometry of the state (United States Geological Survey, 2023). These zones use a Mercator or Lambert conic projection which preserves the shape and scale over smaller areas like zcta's or tracts. By looking into Map of USA State Plane Zones NAD83 in ArcGIS website (ArcGIS Hub, 2020), we can identify that Bay Area is in California Zone 3. Therefore, we are going to use *EPSG: 2227 -- NAD 83 California zone 3* (ftUS). ### Description of DataAirbnb Listings Data of San Francisco, Oakland and San Mateo are extracted from Inside Airbnb Website. Chosen Variables from the list of socio-economic metrics of the ACS are Median Household Income (B19013_001E) and No. Of People aged 18-64 who has computer and Broadband Internet Subscription (B28005_011E). We also use Bay Area Zip Code Areas.For additional Datasets, we are going to use No. of Vehicles (light-duty) 2020 per zip code and Average Electricity Utility Rate 2020 (Commerical Rate) per zip code, both investors owned and non-investor owned utilities. ```{r J_Key, include=FALSE}J_API_KEY <- "537cf6fc34a125e6b1601e06e86958079fc610c8" # this is api key, specific to each individual```Note: API Key can be obtained by signing up at <http://api.census.gov/data/key_signup.html> for Census Data. We, then use tidycensus package to get ACS data from US Census Bureau.```{r setup_api_a, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for setting up API Key"census_api_key(J_API_KEY , install = TRUE, overwrite = TRUE) # Please replace 'J_API_KEY' with API Key obtained from census bureau# Below message is differently coded output for checking system environment variable(CENSUS_API_KEY), not the output of above line``````{r setup_api_message, echo=FALSE, results='hide'}options(tigris_use_cache = TRUE) # setting this option, helps in a way that download shapefiles will be in our Local R environment.# Checking if the API key is successfully set upif (nzchar(Sys.getenv("CENSUS_API_KEY"))) { message_text <- "API key to get Census data is successfully set up."} else { message_text <- "Census API key setup failed."}``````{r message_text, echo=FALSE, results='asis'}print(message_text)```### Data Extraction and Pre-ProcessingWe are making two API calls to retrieve Median Household Income (B19013_001E) and the No. of people aged 18-64 who have a computer and Broadband Internet Subscription (B28005_011E) from ACS 2016-2020 separately. This approach is more efficient and straight-forward than making a single API call with two variables. The latter can results in a spatial or non-spatial file with a long-format attribute table, which, in turn, needs to be converted into a wide-format table for analysis.We are extracting values of two variables for entire US. Then Later, we filter out zip codes specific to Bay Area. This is because, some califorina zcta's extends into other states, Which results in error while extracting zcta's related data for specific states ('zip-Determining which US zipcodes map to more than one state or more than one city?-Geographic Information Systems', 2021). According to US Census Bureau, All ZCTA's codes are valid US Zip Codes(United States Census Bureau, 2023).- Spatial Resolution: zcta's level- Time: 2020```{r us_zip_data, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Cleaning and pre-processing Data"median_hh_income <- get_acs( # get_acs is a function, we are using to get ACS Suvrey data from US Census Bureau geography = "zcta", # we are specifying geography unit for variable we getting, like, ZCTA, Tract, County, State etc. variables = "B19013_001E", # we are geeting MEDIAN HOUSEHOLD INCOME, "B19013_001E" is variable name in ACS year = 2020, # we are getting census data collected from 2016-2020 geometry = FALSE # we use geometry of bay area zipcode shapefile from berkeley library, not ACS Data) %>% rename(estimate_MedianHhI = estimate) %>% # we are renaming retrieved variable(estimate to appropriate name ( estimate_Median_HhI) select(GEOID, estimate_MedianHhI) # we are selecting only GEOID(ZCTA's CODE, also ZIP CODE), estimate Value (ignoring MOE)# Retrieving Internet Use Variableinternet_use <- get_acs( # get_acs is a function, we are using to get ACS Suvrey data from US Census Bureau geography = "zcta", # we are specifying geography unit for variable we getting, like, ZCTA, Tract, County, State etc. variables = "B28005_011E", # we are geeting No.of People aged 18-64, who has computer and Broadband Internet Subscription, "B28005_011E" is variable name in ACS year = 2020, # we are getting census data collected from 2016-2020 geometry = FALSE # we use geometry of bay area zipcode shapefile from berkeley library, not ACS Data) %>% rename(estimate_InternetUse = estimate) %>% # we are renaming retrieved variable(estimate to appropriate name ( estimate_Internet_Use) select(GEOID, estimate_InternetUse) # we are selecting only GEOID(ZCTA's CODE, also ZIP CODE), estimate Value (ignoring MOE)us_zip <- left_join(median_hh_income, internet_use, by = "GEOID") # simply join two datasets by GEOID to get Dataset in req format ``````{r show_us_zip_data}#| code-fold: true#| code-summary: "Code for Showing first six rows of Data being used"head(us_zip) #showing first six rows of Dataset```We downloaded the Bay Area listings data (San Francisco, Oakland and San Mateo) from Airbnb Website and Bay Area Zip-Code Areas from Berkeley Library. Loading all .csv files by using `read.csv` function from `utils` package (Loading process can be found in source code of this file).```{r essential_data, include=FALSE}# Loading airbnb Listings downloaded from AIRBNB website for areas(san francisco, oakland and san mateo)listings_oakland <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/Oakland/listings.csv") listings_sanfrancisco <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/SanFrancisco/listings.csv")listings_sanmateo <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/SanMateo/listings.csv")# Row Binding all three datasets to produce specified bay area's Airbnb Listingslistings_bayarea <- rbind(listings_sanfrancisco, listings_sanmateo, listings_oakland)# Loading Bay area ZCTA's or ZIPCODE Areas for geometry which are downloaded from Berkeley Librarybayarea_zipcodes <- read_sf("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/bayarea_zipcodes/bayarea_zipcodes.shp")```Two Additional Datasets:We will be using Commerical Electricity Rates Dataset for Both investor-owned and non-investor owned utilities per ZIP Code.```{r additional_data, include=FALSE}# We are using Commerical Electricity Rate Dataset for Both investor-owned and non-investor owned utilities per ZIP Codeiou_data <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/iou_zipcodes_2020.csv")non_iou_data <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/non_iou_zipcodes_2020.csv")# We are using No.of Vehicles per ZIP Code datasetvehicle_count_2020 <- read.csv("/home/jay/Documents/ENVS563-Geographic_Data_Science/Assignment-2/data/tables/vehicle-count-2020.csv")``````{r add_data_cleaning, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for Data cleaning and Pre-processing of Electricity rates data"non_iou_data_c <- non_iou_data %>% # Data cleaning process for additional datasets, select(zip, comm_rate) %>% # we select only zip, commerical rate from non-investor based utilities group_by(zip) %>% # we group them by ZIP summarise( Avg_Non_IOU_Comm_Rate = mean(comm_rate, na.rm = TRUE) # Mean of different Non-Investor Based utilities per ZIP )iou_data_c <- iou_data %>% select(zip, comm_rate) %>% # we select only zip, commerical rate from Investor based utilities group_by(zip) %>% # we group them by ZIP summarise( Avg_IOU_Comm_Rate = mean(comm_rate, na.rm = TRUE) # Mean of different Investor Based utilities per ZIP )electric_u <- non_iou_data_c %>% full_join(iou_data_c, by = "zip")%>% # Join both Investor-owned and Non-Investor utility datasets by zipcode mutate( Avg_IOU_Comm_Rate = replace_na(Avg_IOU_Comm_Rate, 0), # replacing any NA values with 0 Avg_Non_IOU_Comm_Rate = replace_na(Avg_Non_IOU_Comm_Rate, 0 ), # replacing any NA values with 0 Combined_Avg_Comm_Rate = (Avg_Non_IOU_Comm_Rate + Avg_IOU_Comm_Rate ) / 2 # using average as statistic for combined Commerical electricity Rate )%>% select(zip, Combined_Avg_Comm_Rate) %>% rename(ZIP = zip) %>% mutate(ZIP = as.character(ZIP)) # selecting only ZIP and New Statistic, ignoring restvc_califorina <- vehicle_count_2020 %>% select(-Date, -Model.Year, -Fuel, -Make) %>% # we are only selecting Zip.code, Vehicles and Duty cols, and ignoring rest filter(Duty == "Light" & Zip.Code != "OOS") %>% # we are filter only Light-Duty vehicles and removing Out of Service(OOS) locations from data group_by(Zip.Code) %>% # group by ZIP Code summarise(No_of_Light_Duty_Vehicles = sum(Vehicles)) %>% rename( ZIP = Zip.Code) # summing up Light-Duty Vehicles per ZIP and then rename it```::: {style="margin-bottom: 20px;"}:::## Methodology### No. of AIRBNB Listings with Mean Price per ZIP Code AreasBay Area Airbnb's Listings, We downloaded have Longitude and Longitude coordinates, So initially, We set CRS of Listings to *EPSG: 4326 - WGS84*. Then, We perform CRS re-projection into *EPSG: 2227 -- NAD 83 California zone 3* (ft-US).We then, perform spatial join for Bay Area ZIP Code Areas and AIRBNB Listings, which results in Overlay Shape-file. This will enable us to understand geographical distribution of AIRBNB Listings and their charactertics over Bay Area ZIP Code Areas.```{r shpf, echo=TRUE, results='asis'}#| code-fold: true#| code-summary: "Code for pre-processing of Shapefiles"listings_bayarea_sf <- listings_bayarea %>% st_as_sf(coords = c("longitude", "latitude")) %>% # creating point shapefile from coordinates st_set_crs(4326) # setting CRS to geodetic system of earth (4326, WGS84) as they are long, lat coordslistings_bayarea_sf_transform <- st_transform(listings_bayarea_sf, crs = 2227) # Now, Changing CRS to mentioned Project CRSoverlay_listings <- st_join(bayarea_zipcodes, listings_bayarea_sf_transform) # Performing Spatial Join to create overlay Shapefile of Airbnb Listings and Bay area ZIP shapefilest_crs(overlay_listings) == st_crs(listings_bayarea_sf_transform) # Verifying CRS of New shapefile```We need to plot No. of Airbnb's and their mean price per zipcode in Bay Area. To find out necessary variables, we group by Zip Code and count no. of non-NA price rows of listings in each zip code and use function(mean) to find mean price of those listings.```{r vars, echo=TRUE, results='hide'}listings_bayarea_zip <- overlay_listings %>% # creating new dataframe with no.of Airbnb's per ZIP and Mean Price per ZIP group_by(ZIP) %>% # group by ZIP summarise( count_airbnb = sum(!is.na(price)), # counting non-NA price rows for listings mean_price = mean(price, na.rm = TRUE) # calculate mean, ignoring NAs )%>% filter(count_airbnb != 0) ``````{r}listings_bayarea_zip1 <- listings_bayarea_zip %>%select(count_airbnb, everything()) # Changing the order of attributes for interactive maplistings_bayarea_zip2 <- listings_bayarea_zip %>%select(mean_price, everything())%>%# Changing the order of attributes for interactive mapmutate(mean_price =round(mean_price, 1)) # rounding mean price to one decimal place.``````{r summary_vars, echo=FALSE, results='hide' }cat("Summary of Airbnb's Count per Zipcode Variable(count_airbnb): ")cat("\n")summary(listings_bayarea_zip1$count_airbnb)cat("\nSummary of Mean Price per Zipcode Variable(mean_price): ")cat("\n")summary(listings_bayarea_zip1$mean_price)cat("\n")# Calculate standard deviationstd_dev_count_airbnb <- sd(listings_bayarea_zip1$count_airbnb, na.rm = TRUE)std_dev_mean_price<- sd(listings_bayarea_zip1$mean_price, na.rm = TRUE)# Display the resultcat("Standard Deviation of Variable(count_airbnb):", std_dev_count_airbnb, "\n")cat("Standard Deviation of Variable(mean_price): ", std_dev_mean_price, "\n")``````{r adj, echo=FALSE, results='hide'}center_lat <- 37.703253 # Replace with the latitude of your area's centercenter_lon <- -122.310913 # Replace with the longitude of your area's centerzoom_level <- 10 ```Based on standard statistics for both Count and Mean Price variables of AIRBNB's per zip code, Minimum value of count variable is 1 and first quartile is 71, so first class is divided accordingly. Median of count variable is 172 and standard deviation is 162.5, so, second and third class is divided as defined. Finally to identify outliers, As Third quartile is 257, standard deviation is 162.5 and Maximum Value is 794, last class has 500 as its lower boundary.Minimum value of Mean Price variable is 71 and first quartile is 154, so first class is divided accordingly. Median of Mean Price is 204 and standard deviation is 297.9, so, second and third class is divided as defined. Finally to identify outliers, As Third quartile is 282.6, standard deviation is 297.9 and Maximum Value is 1818, last two classes have been divided 1000 and 1800 as their lower boundaries respectively.New forms of spatial data like Airbnb data can offer latest snaphots of how people are are using spaces and leads to real-time insights. There will generally have high Heteroskedasticity and also provide detailed data at granular level such as individual Airbnbs. On other hand, New form of spatial data like Airbnb data never represent all segments of markets. Collection of Data can introduce measurement errors due to less standard methodologies.Airbnb data is dynamic and reflects real-time changes in market, like social media or GPS data. Like social Media or GPS data, Airbnb data is user-generated. This is quite different from census-type of data.```{r map_1, echo=TRUE, results='asis'}#| code-fold: true#| code-summary: "Code for Maps of Count and Price of AIRBNBs per ZIP"tmap_mode("view") # Interactive modemap1.11 <- tm_shape(listings_bayarea_zip1) + # listings_bayarea_zip1 is just re-arrangement of listings_bayareaa_zip with first column as count_airbnb tm_fill( "count_airbnb",style="fixed", title = "No. of Airbnb's per ZIP", alpha = 1, breaks=c( 1, 72, 172, 359, 500, Inf), palette = "-viridis")+ tm_borders()+ tm_text("ZIP", size = 0.3)+ tm_view(set.view = c(center_lon, center_lat, zoom_level))map1.12 <- tm_shape(listings_bayarea_zip2) + # listings_bayarea_zip2 is just re-arrangement of listings_bayareaa_zip with first column as Mean Price tm_fill( "mean_price",style="fixed", title = "Mean Price of Airbnb per ZIP", alpha = 1, breaks=c( 74, 155, 287, 501,1000, 1800, Inf), palette = "-viridis")+ tm_borders()+ tm_text("ZIP", size = 0.3)+ tm_view(set.view = c(center_lon, center_lat, zoom_level))tmap_arrange(map1.11, map1.12)```::: {style="margin-bottom: 40px;"}:::### Median Household Income and No.of People aged 18-64 who has Computer and Internet Subscription per ZIP Code AreasWe filter Median Household Income and Internet Use Variable for Bay Area Zip Codes. Then we join ACS Dataset to Bay Area ZIP Code Areas shapefile. As an additional step, we remove all ZIP codes where Internet Use variable is missing.```{r bayarea_zip, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for Cleaning Attribute information in Shapefile"bay_area_zipcodes <- bayarea_zipcodes$ZIP # Creating new var with Bayarea ZIP codesbayarea_zip_vars_data <- us_zip %>% # Filtering ACS Data with Two variables, by Bay Area ZIP codes filter(GEOID %in% bay_area_zipcodes)%>% # Using Bay Area ZIP codes variable, created above rename(ZIP = GEOID) # renaming GEOID with ZIPbayarea_zip_vars <- bayarea_zipcodes %>% left_join(bayarea_zip_vars_data, by = "ZIP") %>% filter(!(is.na(estimate_InternetUse))) # Joining two vars from ACS with bayarea_zipcodes shapefile which is downloaded from berkeley library and also removing NA values of Internet Use variable.``````{r bayarea_show, echo=TRUE}head(bayarea_zip_vars)```We remove all ZIP codes where Median Household Income variable is missing.```{r bayarea_filter, echo=FALSE, results='hide'}bayarea_zip_vars1 <- bayarea_zip_vars %>% filter(!(is.na(estimate_MedianHhI))) # Removing rows of bayarea_zip_vars where 'estimate_MedianHhI' is NA``````{r sum_acs_var, echo=FALSE}cat("Summary of Median Household Income per Zipcode (estimate_MedianHhI): ")cat("\n")summary(bayarea_zip_vars1$estimate_MedianHhI) # Different Standard statistics for estimate_MedianHhIcat("\n")cat("\nSummary of No.of People aged 18-64 who has Computer and Internet Subscription ")cat("per Zipcode (estimate_InternetUse): ")cat("\n")summary(bayarea_zip_vars$estimate_InternetUse) # Different Standard statistics for estimate_MedianHhIcat("\n")# Calculate standard deviationstd_dev_MedianHhI<- sd(bayarea_zip_vars1$estimate_MedianHhI, na.rm = TRUE)std_dev_iu<- sd(bayarea_zip_vars$estimate_InternetUse, na.rm = TRUE)# Display the resultcat("Standard Deviation of Variable(estimate_MedianHhI): ", std_dev_MedianHhI, "\n")cat("\n")cat("Standard Deviation of Variable(estimate_InternetUse): ", std_dev_iu, "\n")```Based on standard statistics for both variables of ACS per zip code, Range and Standard Deviation for both variables are large, We use Fisher-Jenks Data Classification method, because of highly-skewed data.Bay Area ZIP Code Areas is Mixed-Use Type of Neighborhood. Most of neighborhoods have medium Household Incomes and High Internet Subscriptions. Socio-Economic Indicators like Median Household Income and Internet Use helps in delineating this typology of Neighborhood. San Francisco and Oakland have more-affulent neighborhoods due to High Internet Usage and medium Household Income, whereas San Mateo have High Household Income Neighborhoods and Medium Internet Usage.Based on the given Classification, Reasonable Hypothesis will be that AIRBNB would cluster in San Francisco and Oakland, due to High Internet Usage and Medium Household Income (Middle to High-Class Residential Areas). AIRBNB listings would also cluster in Areas with low commercial electricity rate and High No.of Vehicles.::: {style="overflow: auto; max-height: 700px;"}```{r map2_code, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for Maps of Median Household Income and Internet Use per ZIP"tmap_mode("plot")map2.1 <- tm_basemap() + tm_shape(bayarea_zip_vars1) + # shapefile for map 2 tm_fill(col = "estimate_MedianHhI", palette = "YlGn", n = 5, style = "fisher", title = "Median Household Income (in $ USD)") + # estimate_MedianHhI tm_compass(position = c(0.85, 0.85)) + # Adding Compass tm_scale_bar(position = c(0.625, 0.02)) + # Adding Scalebar tm_borders(lwd = 0.3)+ # Borders for zcta's polygons tm_layout( legend.title.size = 0.715, legend.text.size = 0.6, inner.margins = c(0.22, 0.13, 0.02, 0.08), legend.position = c(0.05, 0.02), legend.width = 0.5, bg.color = "#eaf5fa", main.title = "Median Household Income per ZIP Code", # Add the map title within tm_layout main.title.size = 0.7, main.title.position = "center" )map2.2 <- tm_basemap() + tm_shape(bayarea_zip_vars) + tm_fill(col = "estimate_InternetUse", palette = "YlGn", n = 5, style = "fisher", title = "Broadband Internet Connection (People)") + tm_compass(position = c(0.85, 0.85)) + tm_scale_bar(position = c(0.625, 0.02)) + tm_borders(lwd = 0.3)+ tm_layout( legend.title.size = 0.715, legend.text.size = 0.6, inner.margins = c(0.185, 0.13, 0.02, 0.08), legend.position = c(0.05, 0.02), legend.width = 0.5, bg.color = "#eaf5fa", main.title = "No. of People aged (18-64) using Broadband Internet per ZIP Code", # Add the map title within tm_layout main.title.size = 0.7, main.title.position = "center" )```:::```{r map2, echo=TRUE, fig.align='center'}tmap_arrange(map2.1, map2.2)```### MAP 3: AIRBNB Listings and Internet Use Variable (Combining Datasets)We change the order of attributes in Bay Area ZIP Codes as following:```{r bayarea_map3, echo=FALSE, results='hide'}bayarea_zip_vars_map3 <- bayarea_zip_vars %>% select(estimate_InternetUse, everything()) # Changing the order of attributes for interactive map```We calculate Natural Logarithm of Price of Airbnb Listings in Overlay Shapefile and Checking CRS of both shapefiles:```{r airbnb_listings, echo=TRUE, results='asis'}#| code-fold: true#| code-summary: "Code for Calculating Log prices of AIRBNBs"listings_bayarea_sf_transform_copy <- listings_bayarea_sf_transformlistings_bayarea_sf_transform_copy$nl_price <- log(listings_bayarea_sf_transform_copy$price) # calculating natural logarithm of price ( log to base e)st_crs(bayarea_zip_vars_map3) == st_crs(listings_bayarea_sf_transform_copy) # checking CRS of both shapefiles```We are changing the order of attribute in Overlay Listings Shapefile as following:```{r map3_code, echo=FALSE, results='hide'}listings_bayarea_sf_transform_copy <- listings_bayarea_sf_transform_copy %>% select(nl_price, everything()) %>% mutate(nl_price = round(nl_price, 1))```Based on standard statistics for Internet Usage(People) per zip code and Natural Logarithm of AIRBNB Price, Range and Standard Deviation are large, We use Fisher-Jenks Data Classification method, because of highly-skewed data.AIRBNB Listings are majorly clustered in San Francisco, Daly City, South San Francisco, Burlingame, Foster City, Redwood City and Oakland areas(geographical observations from below map). High Natural Logarithm of Prices are found in San Francisco, West Coast Seaside and Redwood City(San Carlos).Apparently in San Francisco, ZIP Code areas with High Internet Usage have High AIRBNB Listings. ZIP Code Areas with 5,500 people using internet, have high price for AIRBNB Listings.```{r map3, echo=TRUE, results='asis'}#| code-fold: true#| code-summary: "Code for Maps of Internet Use per ZIP in San Francisco"tmap_mode("view") # interactive mode activatingmap3 <- tm_shape(bayarea_zip_vars_map3) + tm_fill(col = "estimate_InternetUse", alpha = 1, palette = "YlGn", n = 5, style = "fisher",title = "Broadband Internet Connection(People)") + tm_borders() + tm_text("ZIP", size = 0.3)+ tm_shape(listings_bayarea_sf_transform_copy) + tm_dots(col = "nl_price", palette = "Reds", n = 3, style = "fisher", title = "Natural Logarithm of Airbnb's Price") + tm_view(set.view = c(center_lon, center_lat, zoom_level))# Set legend position for interactive viewmap3```### MAP 4: Spatial Auto-CorrelationCalculating list of neighbors(Neighborhoods) and Queen-based Spatial Weights Matrix for Bay Area ZIP Code areas which have Median Household Income:```{r spatial_bayarea, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for calculating spatial weights martrix for ZIP codes which have Median Household Income"nb_q1 <- poly2nb(bayarea_zip_vars1, queen = TRUE) # shapefile used, have no NA values of Median Household Income variablew_queen1 <- nb2listw(nb_q1, style = "B", zero.policy=TRUE) # queen-contiguity based Weights for list of neighbors for all observations isolates1 <- which(w_queen1$neighbours == "0")bayarea_zip_vars1_NoZero <- bayarea_zip_vars1[-c(isolates1),] # shapefile where there are no observation with zero neighbours based on queen-contiguity```Calculating list of neighbors(Neighborhoods) and Queen-based Spatial Weights Matrix for Bay Area ZIP Code areas which have Internet Usage:```{r spatial_bay2, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for calculating spatial weights martrix for ZIP codes which have Internet Use"nb_q2 <- poly2nb(bayarea_zip_vars, queen = TRUE) # shapefile used, have no NA values of Broadband Internet Usew_queen2 <- nb2listw(nb_q2, style = "B", zero.policy=TRUE)isolates2 <- which(w_queen2$neighbours == "0")bayarea_zip_vars_NoZero <- bayarea_zip_vars[-c(isolates2),] # shapefile where there are no observation with zero neighbours based on queen-contiguity```#### Global Spatial Auto-Correlation:Calculating list of neighbors(Neighborhoods) and Queen-based Standardized Spatial Weights Matrix for Bay Area ZIP Code areas which have Median Household Income or/and Internet Usage(People) :```{r spatial_update, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for calculating spatial weights martrix for ZIP codes which have Median Household Income or Internet Use"nb_q1 <- poly2nb(bayarea_zip_vars1_NoZero, queen = TRUE) # Constructing neighbours list from filtering observations which have no zero neighbours and no NA values of Median Household Incomew_queen_std1 <- nb2listw(nb_q1, style = "W") # creating spatial weights matrix using queen contiguity and row-standardardised weightsnb_q2 <- poly2nb(bayarea_zip_vars_NoZero, queen = TRUE) # Constructing neighbours list from filtering out observations which have zero neighbours and no NA values of Broadband Internet Usew_queen_std2 <- nb2listw(nb_q2, style = "W") # creating spatial weights matrix using queen contiguity and row-standardardised weights```Calculating Spatial Lag for Median Household Income and Internet Usage(People):```{r spatial_lag, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for calculating spatial lag for ZIP codes"bayarea_zip_vars1_NoZero$sl_MedianHhI <- lag.listw(w_queen_std1, bayarea_zip_vars1_NoZero$estimate_MedianHhI) # calculating spatial lag for Medain Household Income Variable in Bay Area bayarea_zip_vars_NoZero$sl_InternetUse <- lag.listw(w_queen_std2, bayarea_zip_vars_NoZero$estimate_InternetUse) # calculating spatial lag for Internet Use Variable in Bay Area ``````{r spatial_lag_std, echo=FALSE, results='hide'}bayarea_zip_vars1_NoZero$MedianHhI_std <- (bayarea_zip_vars1_NoZero$estimate_MedianHhI - mean(bayarea_zip_vars1_NoZero$estimate_MedianHhI))/sd(bayarea_zip_vars1_NoZero$estimate_MedianHhI) # Spatial Lag for Median Household Income using standardized weights bayarea_zip_vars1_NoZero$sl_Median_HhI_std <- lag.listw(w_queen_std1, bayarea_zip_vars1_NoZero$MedianHhI_std)bayarea_zip_vars_NoZero$InternetUse_std <- (bayarea_zip_vars_NoZero$estimate_InternetUse - mean(bayarea_zip_vars_NoZero$estimate_InternetUse))/sd(bayarea_zip_vars_NoZero$estimate_InternetUse)# Spatial Lag for Internet Use variable using standardized weights bayarea_zip_vars_NoZero$sl_InternetUse_std <- lag.listw(w_queen_std2, bayarea_zip_vars_NoZero$InternetUse_std)```Moran I's for Median Household Income in Bay Area ZIP Code Areas:```{r moran_median, echo=TRUE}#| code-fold: true#| code-summary: "Code for calculating Moran's I statitstic for Median Household Income"moran.mc(bayarea_zip_vars1_NoZero$estimate_MedianHhI, w_queen_std1, nsim=1000, alternative="greater") # Moran's I statistic for Median Household Income Variable in Bay Area```Moran I's for Median Household Income in Bay Area is 0.396 and p-value is 0.000999. So, Observed Spatial Pattern is statistically Significant. P-value means chances of obtaining Moran I's value greater than original, given random spatial distribution. If it is less than 0.05, We reject Null Hypothesis i.e, observed spatial pattern is due to random chance. Moran I's is analogous to Chi-Square Statistic. Spatial distribution of Median Household Income is spatially concentrated than random spatial distribution.Moran I's for Internet Usage(People) in Bay Area ZIP Code Areas:```{r moran_internet, echo=TRUE}#| code-fold: true#| code-summary: "Code for calculating Moran's I statitstic for Internet Use"moran.mc(bayarea_zip_vars_NoZero$estimate_InternetUse, w_queen_std2, nsim=1000, alternative="greater") # Moran's I statistic for Interent Use Variable in Bay Area```Moran I's for Internet Usage (People) in Bay Area is 0.23313 and p-value is 0.000999. So, Observed Spatial Pattern is statistically Significant. Spatial distribution of Internet Usage (People) is mild spatially concentrated than random spatial distribution.#### Local Spatial Auto-Correlation:Calculating LISAs for Bay Area ZIP COde Areas for both Median Household Income and Internet Usage(People):```{r lisa, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for calculating LISA statistics"lisa_perm1 <- localmoran_perm(bayarea_zip_vars1_NoZero$estimate_MedianHhI, w_queen_std1, nsim=1000, alternative="two.sided") # Lisa Clustering for Median Household Income in Bay Arealisa_perm2 <- localmoran_perm(bayarea_zip_vars_NoZero$estimate_InternetUse, w_queen_std2, nsim=1000, alternative="two.sided") # Lisa Clustering for Internet Use Variable in Bay Area```Calculating Quadrants where local moran I's for observation, with significance cutoff at 0.2. As there are no quadrants at significance level 0.1, it is set by seeing to have least no.of levels for at least one variable:```{r lisa_q, echo=TRUE}#| code-fold: true#| code-summary: "Code for calculating LISA quadrants"quadrants1 <- hotspot(lisa_perm1, Prname="Pr(z != E(Ii)) Sim", cutoff=0.2) # Creating quadrants from significance values of local Moran's I statistic calculated from Median Household Incomequadrants2 <- hotspot(lisa_perm2, Prname="Pr(z != E(Ii)) Sim", cutoff=0.2) # Creating quadrants from significance values of local Moran's I statistic calculated from Internet Use Variable``````{r qua_lev, echo=FALSE}cat("Levels of Quadrant 1 :\n")cat(levels(quadrants1), sep = ", ") # Significant Levels of quadrant 1cat("\n\nLevels of Quadrant 2 :\n")cat(levels(quadrants2), sep = ", ") # Significant Levels of Quadrant 2```Adding Quadrants back to Bay Area ZIP Code Areas Shapefile:```{r lisa_q_update, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for adding new categorical variable back to shapefile"bayarea_zip_vars1_NoZero$quadrant1 <- as.character(quadrants1) %>% replace_na("Not significant") # Adding created quadrants to shapefile(no NA values for Median Household Income) where no observation with zero neighbours based on queen-contiguitybayarea_zip_vars_NoZero$quadrant2 <- as.character(quadrants2) %>% replace_na("Not significant") # Adding created quadrants to shapefile(no NA values for Internet Use Variable) where no observation with zero neighbours based on queen-contiguity``````{r lisa_borders, echo=FALSE, results='hide'}borders1 <- tm_shape(bayarea_zip_vars1_NoZero) + tm_fill() + tm_borders(col = "black", lwd = 0.2)borders2 <- tm_shape(bayarea_zip_vars_NoZero) + tm_fill() + tm_borders(col = "black", lwd = 0.2)```::: {style="overflow: auto; max-height: 400px;"}```{r map_4_pre, echo=FALSE, results='hide'}hh1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "High-High") # filtering only 'High-High' quadrant to maphh_map1 <- tm_shape(hh1) + tm_fill(col = "royalblue2", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)ll1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "Low-Low") # filtering only 'High-High' quadrant to mapll_map1 <- tm_shape(ll1) + tm_fill(col = "red2", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)lh1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "Low-High") # filtering only 'Low-High' quadrant to maplh_map1 <- tm_shape(lh1) + tm_fill(col = "gold", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)hl1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "High-Low") # filtering only 'High-Low' quadrant to maphl_map1 <- tm_shape(hl1) + tm_fill(col = "darkgreen", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)ns1 <- bayarea_zip_vars1_NoZero %>% dplyr::filter(quadrant1 == "Not significant") # filtering only 'Not significant' quadrant to mapns_map1 <- tm_shape(ns1) + tm_fill(col = "lightgrey", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)hh2 <- bayarea_zip_vars_NoZero %>% dplyr::filter(quadrant2 == "High-High") # filtering only 'High-High' quadrant to maphh_map2 <- tm_shape(hh2) + tm_fill(col = "royalblue2", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)ll2 <- bayarea_zip_vars_NoZero %>% dplyr::filter(quadrant2 == "Low-Low") # filtering only 'Low-Low' quadrant to mapll_map2 <- tm_shape(ll2) + tm_fill(col = "red2", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)hl2 <- bayarea_zip_vars_NoZero %>% dplyr::filter(quadrant2 == "High-Low") # filtering only 'High-Low' quadrant to maphl_map2 <- tm_shape(hl2) + tm_fill(col = "darkgreen", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)ns2 <- bayarea_zip_vars_NoZero %>% dplyr::filter(quadrant2 == "Not significant") # filtering only 'Not significant quadrant to mapns_map2 <- tm_shape(ns2) + tm_fill(col = "lightgrey", alpha=0.8)+ tm_borders(col = "black", lwd = 0.3)```:::::: {style="margin-top: 40px;"}:::To view complete chunks, see Source code.As There were no quadrants when cutoff is at 0.1, We set cutoff at 0.2 to have minimum no. of Quadrants for atleast one variable. These LISA clusters are likely to arranged by random chance. It means Observed Clustering at whole Bay Area Level has Spatial Structure, whereas LISA Clusters observed are, may be due to random chance. Observed Spatial Pattern in LISA clusters is not statistically significant.But, also, It is very highly unlikely to say that observed spatial pattern in LISA clusters is just by random chance. This requires further and more advanced analysis.::: {style="overflow: auto; max-height: 700px;"}```{r map_4, echo=TRUE, results='asis'}#| code-fold: true#| code-summary: "Code for creating LISA maps"tmap_mode("plot")lisa_map_cluster1 <- borders1 + hh_map1 + ll_map1 + hl_map1 +lh_map1 + ns_map1 + # combining all quadrant maps tm_compass(position = c(0.85, 0.85)) + tm_scale_bar(position = c(0.625, 0.02)) + tm_add_legend(type = "fill", col = c("royalblue2", "red2", "darkgreen", "gold", "lightgrey"), labels = c("High-High", "Low-Low", "High-Low", "Low-High", "Not significant"), title = "LISA cluster(Median Household Income)") + tm_layout( legend.title.size = 0.715, legend.text.size = 0.6, inner.margins = c(0.22, 0.13, 0.02, 0.08), legend.position = c(0.05, 0.02), legend.width = 0.5, bg.color = "#eaf5fa", main.title = "LISA Clusters for Median Household Income Variable", main.title.size = 0.7, main.title.position = "center" )lisa_map_cluster2 <- borders2 + hh_map2 + ll_map2 + hl_map2 + ns_map2 + # Combining All quadrant maps tm_compass(position = c(0.85, 0.85)) + tm_scale_bar(position = c(0.625, 0.02)) + tm_add_legend(type = "fill", col = c("royalblue2", "red2", "darkgreen", "lightgrey"), labels = c("High-High", "Low-Low", "High-Low", "Not significant"), title = "LISA cluster(Broadband Internet Use)") + tm_layout( legend.title.size = 0.715, legend.text.size = 0.6, inner.margins = c(0.185, 0.13, 0.02, 0.08), legend.position = c(0.05, 0.02), legend.width = 0.5, bg.color = "#eaf5fa", main.title = "LISA Clusters for Broadband Internet Use Variable", main.title.size = 0.7, main.title.position = "center" )```:::```{r map4_final, echo=TRUE, fig.align='center'}tmap_arrange(lisa_map_cluster1, lisa_map_cluster2)```### Geo-Demographic Classification in Bay AreaWe create overlay shapefile by spatial join of Bay Area ZIP Code Areas and AIRBNB Listings. Then, we group by to calculate all variables. Also, Join Average Commercial Electricity rate and No.of Light-Duty Vehicles per Zip Code.```{r geodemo, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for calculating dataframe required for Geodemographic Classification"overlay_listings <- st_join(bayarea_zip_vars, listings_bayarea_sf_transform, join = st_intersects) # spatial join to create Overlay Shapefilesoverlay_listings_zip <- overlay_listings %>% group_by(ZIP) %>% summarise( count_airbnb = sum(!is.na(price)), # count non-NA price rows for count mean_price = mean(price, na.rm = TRUE), # Mean price mean_review = mean(number_of_reviews, na.rm = TRUE), # No. of Reviews mean_min_nights = mean(minimum_nights, na.rm = TRUE), # No. of Minimum Nights stay Median_HhI = mean(estimate_MedianHhI, na.rm = TRUE), # Median Household income Internet_Use = mean(estimate_InternetUse, na.rm = TRUE) # Internet Usage )%>% filter(count_airbnb != 0) overlay_listings_zip <- overlay_listings_zip %>% left_join(vc_califorina, by = "ZIP") # Join overlay listings with Light-Duty Vehicles countoverlay_listings_zip <- overlay_listings_zip %>% left_join(electric_u, by = "ZIP") #Joining Overlay Listings with Commerical Electricity Rate.overlay_listings_zip <- na.omit(overlay_listings_zip) # removing NA values```We create 8-tuple Vector for multi-variate classification:```{r geo_names, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for sub-setting 8-tuple vector"airbnb_ra <- c('count_airbnb', 'mean_price', 'mean_review', 'mean_min_nights', 'Median_HhI', 'Internet_Use', 'No_of_Light_Duty_Vehicles', 'Combined_Avg_Comm_Rate') # geodemographic arguments - (8 - tuple) , multivariate classification```Subsetting Overlay Listings Shapefile without geometry attribute:```{r geo_pre, echo=TRUE, results='hide'}#| code-fold: true#| code-summary: "Code for sub-setting attribute information from shapefile"overlay_listings_zip_NoGeo <- st_drop_geometry(overlay_listings_zip[, airbnb_ra, drop=FALSE]) # we are dropping geometry col in shpoverlay_listings_zip_NoGeo <- overlay_listings_zip_NoGeo[complete.cases(overlay_listings_zip_NoGeo), ] # removing NA Values```K-Means Clustering:```{r geodemo_setting, echo=TRUE, results='hide'}set.seed(12345) # setting seedk6cls <- kmeans(overlay_listings_zip_NoGeo, centers=6, iter.max = 1000) # KNN clusteing - 6 clusters (Each Observation will be 8-tuple )overlay_listings_zip$k6cls <- k6cls$cluster # Type of cluster assigned bacnk to overlay listings``````{r geodemo_centres, echo=TRUE}k6cls$centers # centres of 6 clusters, six 8-tuple vectors defines centre of each cluster```By observing centers of different clusters, Main types of Neighborhoods can be identified as Tourist-focused areas, Residential Areas, Economically-diverse areas and Vehicle-dependent areas etc. Tourist-Focused Areas can be identified by High Count, High Internet Usage, Low Median Household Income. Residential Areas can be identified by High Median Household Income, High Internet Usage, High No.of Light-duty Vehicles etc. Economically Diverse ares can be identified by mediummedian household income (Median_HhI) and internet usage (Internet_Use). Vehicle-dependent areas can be identified by High Vehile count.Characteristics helping to delineate typology are AIRBNB Metrics('count_airbnb', 'mean_price', 'mean_review', 'mean_min_nights'), Median Household Income, Internet Usage(People), No. of Light-Duty vehicles, Average Electricity Commercial Rate.Using this Classification, We can target areas most in need, by, like supporting Economically weak Neighborhoods which can be identified by Low Median Household Income, Internet Usage(People) for Development Programs and Infrastructure Improvements. For Managing Tourist Activity, Neighborhoods with high Airbnb count, urban planning committees can focus on minimizing impact of Tourism on local communities, like affordable housing. For Transportation and urban planning, Neighborhoods with High Vehicle count might benefit from improved Public Transportation and Development of Local Amenities to reduce travelling. For Community Services like Digital Hubs, Which might improve overall livelihood in Neighborhoods with Low Internet Usage and Varied Median Hosehold Income.::: {style="overflow: auto; max-height: 700px;"}```{r geodemo_map_pre, echo=TRUE, results='asis'}#| code-fold: true#| code-summary: "Code for creating Geo-demographic clustering map"tmap_mode("plot")map_cluster1 <- tm_basemap() + tm_shape(overlay_listings_zip) + tm_fill(col = "k6cls", palette = viridis(256), style = "cont", title = " Type of Cluster") + tm_compass(position = c(0.85, 0.88)) + tm_scale_bar(position = c(0.58, 0.02)) + tm_borders(col = "white", lwd = 0.2)+ tm_layout( legend.title.size = 0.7, legend.text.size = 0.6, inner.margins = c(0.2, 0.13, 0.02, 0.08), legend.position = c(0.05, 0.02), legend.width = 0.5, bg.color = "#eaf5fa", main.title = "Type of Clusters in Bay Area", # Add the map title within tm_layout main.title.size = 0.7, main.title.position = "center" )map_veh <- tm_basemap() + tm_shape(overlay_listings_zip) + tm_fill(col = "No_of_Light_Duty_Vehicles", palette = "YlGn", n = 5, style = "fisher", title = "Vehicles per ZIP") + tm_compass(position = c(0.85, 0.88)) + tm_scale_bar(position = c(0.58, 0.02)) + tm_borders(col = "black", lwd = 0.2)+ tm_layout( legend.title.size = 0.7, legend.text.size = 0.6, inner.margins = c(0.185, 0.13, 0.02, 0.08), legend.position = c(0.05, 0.02), legend.width = 0.5, bg.color = "#eaf5fa", main.title = "No. of Vehicles(Light-Duty) in Bay Area", # Add the map title within tm_layout main.title.size = 0.7, main.title.position = "center" )map_electric <- tm_basemap() + tm_shape(overlay_listings_zip) + tm_fill(col = "Combined_Avg_Comm_Rate", palette = "YlGn", n = 5, style = "fisher", title = "Commerical Electrical Rate") + tm_compass(position = c(0.85, 0.88)) + tm_scale_bar(position = c(0.58, 0.02)) + tm_borders(col = "black", lwd = 0.2)+ tm_layout( legend.title.size = 0.7, legend.text.size = 0.6, inner.margins = c(0.185, 0.13, 0.02, 0.08), legend.position = c(0.05, 0.02), legend.width = 0.5, bg.color = "#eaf5fa", main.title = "Commercial Electrical Rate in Bay Area", # Add the map title within tm_layout main.title.size = 0.7, main.title.position = "center" )```:::```{r geodemo_map, echo=TRUE, fig.align='center'}tmap_arrange(map_cluster1, map_veh, map_electric, nrow =1, ncol= 3)```## ConclusionGeo-Demographic Classification allows for making data-informed decisions for various policy making and Societal Improvements. In Reality, It is much more nuanced, we need to have understanding of different neighborhood dynamics, real-estate markets and other Socio-Economic Factors. But, With target areas based on neighborhood charactertistics, Policymakers can very effectively address local issues and enhance overall community well-being.Though Multi-Variate Clustering is performed, we neglected timeline factor, which introduces temporal inconsistencies. There can be biases introduced because of New forms of spatial data like AIRBNB Data. However, Apart from above insights in geodemographic classifcation, To make well-informed decisions for policymakers, We need more sophisticated approaches and Real-Time Verification procedures.::: callout-note## Note:I apologize for any references I may have inadvertently forgotten to mention in this case study. All my work is built upon the invaluable contributions of the giants in the data science field, on whose shoulders I stand. I am deeply grateful for their insights and advancements, which have made this study possible.:::## References1. GIS Geography. 2023. Choropleth Maps - A Guide to Data Classification. \[online\] Available at: <https://gisgeography.com/choropleth-maps-data-classification/> (Accessed : 08 January 2024).2. Ross, S. (2009) *Introduction to Probability and Statistics for Engineers and Scientists.* 4th edn. London: Elsevier Academic Press.3. United States Geological Survey (2023) *What is the State Plane Coordinate System? Can GPS provide coordinates in these values*?. Available at: <https://www.usgs.gov/faqs/what-state-plane-coordinate-system-can-gps-provide-coordinates-these-values#:~:text=The%20State%20Plane%20Coordinate%20System%20(SPCS)%2C%20which%20is%20only,the%20state%27s%20size%20and%20shape.> (Accessed: 08 January 2024).4. United States Census Bureau (2023) *ZIP Code Tabulation Areas (ZCTAs)*. Available at: <https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html> (Accessed: 08 January 2024).5. ArcGIS Hub (2020) *USA State Plane Zones NAD83*. Available at: <https://hub.arcgis.com/datasets/esri::usa-state-plane-zones-nad83/about> (Accessed: 08 January 2024).6. 'zip-Determining which US zipcodes map to more than one state or more than one city?-Geographic Information Systems' (2021) *Geographic Information Systems Stack Exchange*. Available at: <https://gis.stackexchange.com/questions/53918/determining-which-us-zipcodes-map-to-more-than-one-state-or-more-than-one-city> (Accessed: 08 January 2024).7. UC Berkeley GeoData Repository (2004) *Bay Area ZIP Code Areas*\[Dataset\]. Available at: <https://geodata.lib.berkeley.edu/catalog/ark28722-s7888q> (Accessed: 08 January 2024).8. California Open data (2023) *Vehicle Fuel Type Count by Zip Code*\[Dataset\]. Available at: <https://data.ca.gov/dataset/vehicle-fuel-type-count-by-zip-code> (Accessed: 08 January 2024).9. National Renewable Energy Laboratory (2022) *U.S. Electric Utility Companies and Rates: Look-up by Zipcode (2020)*\[Dataset\]. Available at: <https://catalog.data.gov/dataset/u-s-electric-utility-companies-and-rates-look-up-by-zipcode-2020> (Accessed: 08 January 2024).